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ABSTRACT 

This paper presents a higher level representation for videos aiming 
at video genre retrieval. In video genre retrieval, there is a chal¬ 
lenge that videos may comprise multiple categories, for instance, 
news videos may be composed of sports, documentary, and action. 
Therefore, it is interesting to encode the distribution of such genres 
in a compact and effective manner. We propose to create a visual 
dictionary using a genre classifier. Each visual word in the proposed 
model corresponds to a region in the classification space determined 
by the classifier’s model learned on the training frames. Therefore, 
the video feature vector contains a summary of the activations of 
each genre in its contents. We evaluate the bag-of-genres model for 
video genre retrieval, using the dataset of MediaEval Tagging Task 
of 2012. Results show that the proposed model increases the quality 
of the representation being more compact than existing features. 

Index Terms — video genre retrieval, video representation, vi¬ 
sual dictionaries, semantics 

1. INTRODUCTION 

The retrieval of videos by genre is a challenging application, as 
videos may be composed of visually different excerpts. Eor instance, 
a news video can comprise multiple categories, like sports, docu¬ 
mentary, health, and others. A video retrieval system aiming at re¬ 
trieving videos with similar content should be aware of such property 
in order to obtain better results. 

In this paper, we focus on video retrieval by genre based only 
on visual information. No tags or textual descriptions are consid¬ 
ered. One important step in this scenario is feature extraction from 
videos. There are mainly two kinds of feature descriptors for videos: 
descriptors that consider motion and descriptors based on isolated 
frames. Motion-based descriptors usually obtain space-time inter¬ 
est points and extract histograms of those local points or obtain his¬ 
togram of motion patterns ID. Descriptors based on isolated frames 
are usually derived from image feature extraction. Erames are repre¬ 
sented individually and then a pooling function can be used to obtain 
the video feature vector. The advantage of the first kind of descrip¬ 
tors is obviously the encoding of transitions between frames. The 
advantage of the second kind is the large number of descriptors al¬ 
ready proposed for image representation. 

Regardless of motion, many of the state-of-the-art solutions for 
feature extraction are based on visual dictionaries. Such dictionar¬ 
ies are commonly based on local patches, which are semantically 
poor. Therefore, both kinds of descriptors usually present the same 
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property: the video feature vector has few semantics from the human 
perspective. 

In this paper, we present a novel approach for video representa¬ 
tion in video genre retrieval tasks, called Bag-of-Genres (BoG). The 
proposed method is based on dictionaries of genres created from 
genre classifiers. Each visual word in the BoG model is a genre- 
labeled region of the classification space defined by the classifier’s 
model. Thus, the final video representation corresponds to an ac¬ 
tivation vector of its contents to each of the genres in the dictio¬ 
nary. Therefore, each component of the representation model has 
self-contained semantics and is directly related to a specific concept. 

We validated the BoG model in the dataset of MediaEval Tag¬ 
ging Task of 2012. We evaluate the importance of the genre classifier 
in the model as well as the quality of the BoG representation. Al¬ 
though the genre classifier has low accuracy, the BoG model could 
work well in the experiments. The results are comparable to the ex¬ 
isting baselines, even BoG being much more compact. 

2. RELATED WORK 

In this section, we describe related work focusing on works that are 
based on visual dictionaries and works that aim at including seman¬ 
tics in the representation. 

Many solutions exist in the literature aiming at including seman¬ 
tics in the representation. There are techniques in which an image 
is represented as a scale-invariant response map of a large number 
of pre-trained generic object detectors ||2], which could be seen as 
a dictionary of objects. Poselets have also been used similarly to a 
dictionary of poses for recognizing people poses O. Labeled local 
patches have also been used for having a dictionary with more se¬ 
mantics 111. Boureau et al. fS) also present a way to supervise the 
dictionary creation. Other approaches can also be considered as re¬ 
lated to the intention of having dictionaries with more meaningful 
visual words iiiTiia 

The approach proposed here is closely related to the Bag-of- 
Scenes (BoS) model |0, in which the video feature vector is an acti¬ 
vation vector of scenes. As scenes are more semantically meaningful 
than local patches, the BoS feature space is semantically richer. Each 
dimension in the BoS space corresponds to a semantic concept. 

The main novelty of BoG in relation to previous works, specially 
BoS, is that we use a genre classifier as visual dictionary. In the BoS 
model, the visual dictionary is based directly on the feature vectors 
of the scenes. The advantages of using a classifier is that it better 
delineates the frontiers among visual words and tends to be more 
robust to feature dimensionality. Another advantage is the compact 
BoG vector, as its dimensionality directly corresponds to the number 
of genres in the problem. 
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Fig. 1. An overview of the Bag-of-Genres model. 


3. BAG OF GENRES 

In this section, we describe the Bag-of-Genres (BoG) model for 
video representation. This model is based on a dictionary of gen¬ 
res, in which each visual word corresponds to a decision region of 
the classification model defined by a genre classifier. Thus, each 
video is represented by a vector of activations of its frames to each 
of the genres in the dictionary. 

The main advantage of the BoG model is that it relies on ele¬ 
ments that have more semantics according to the human perception. 
Traditional dictionaries based on local features, like SIFT or STIP, 
are composed by visual words which carry no semantic information, 
like comers and edges fZI In the BoG model, as the visual words are 
genre-labeled regions of the classification space, the activation vec¬ 
tor has one dimension for each genre, making it simple to analyze 
the presence or absence of each genre into a video. 

Figure[^shows a flowchart of the stages involved in representing 
video content using the BoG model. On top, we show how the visual 
dictionary is created. At the bottom, we show how this codebook is 
used to represent video content. 

The creation of the visual dictionary is performed as follows. 
Given a set of training videos with known genre labels, we first dis¬ 
card a lot of redundant information, taking only a subset of video 
frames. Techniques like sampling at fixed-time intervals or sum¬ 
marization methods (TOJITTI are examples of possibilities fox frame 
selection. In this paper, frames were selected using the well-known 
FFmpeg tooj^in a sampling rate of one frame per second. After that, 
we perform the feature extraction from each of the selected frames 
in order to encode their visual content mio feature vectors. Such fea¬ 
tures can be any, like for instance, color histograms, GIST, bags of 
quantized SIFT features. Then, those feature vectors and their asso¬ 
ciated genre labels are used as input for training a genre classifier. 
The obtained classification model represents the dictionary of genres 
used for representing videos. 

After creating the visual dictionary, we should represent videos 
according to the dictionary space. Given an input video, we initially 
apply/ram^ selection and feature extraction from each frame. After 
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that, the feature vectors of each frame must be coded according to the 
dictionary of genres. Each feature vector is classified by the genre 
classifier, which predicts a genre label for the frame. The labeling 
process is analogous to the coding step of traditional visual dictio¬ 
naries tV2\ . Finally, a normalized frequency histogram is obtained 
by counting the occurrences of each of the genre labels, forming the 
bag-of-genres representation for the input video. Such step can be 
understood as pooling the frame genres 0. 

The dimensionality of the bag-of-genres feature space is directly 
related to the number of genres used for training the genre classifier 
during the dictionary creation. Therefore, as in many applications 
the number of genres is small, the bag of genres is usually more 
compact than existing features. 

4. EXPERIMENTS AND RESULTS 

Experiments were conducted on a benchmarking dataset provided 
by the MediaEval 2012 organizers for the Genre Tagging Task Qs). 
The dataset is composed of 14,838 videos (3,288 hours) collected 
from the blip.t\[^and is divided into a training set of 5,288 videos 
(36%) and a test set of 9,550 videos (64%). Those videos are dis¬ 
tributed among 26 video genre categories assigned by the blip.tv 
media platform, namely (the numbers in brackets are the total num¬ 
ber of videos): art (530), autos and vehicles (21), business (281), 
citizen journalism (401), comedy (515), conferences and other 
events (247), documentary (353), educational (957), food and drink 
(261), gaming (401), health (268), literature (222), movies and 
television (868), music and entertainment (1148), personal or auto¬ 
biographical (165), politics (1107), religion (868), school and edu¬ 
cation (171), sports (672), technology (1343), environment (188), 
mainstream media (324), travel (175), video blogging (887), web 
development (116), and default category (2349, which comprises 
videos that cannot be assigned to any of the previous categories). 
The main challenge of this collection is the high diversity of genres, 
as well as the high variety of visual contents within each genre 
category |[T4l[T5l . 


^ http://blip.tv (As of January, 2015). 



























































After frame selection (1 per second), the training set has 
3,943,375 frames and the test set has 7,273,996 frames. Differ¬ 
ent image descriptors were evaluated for extracting features from 
such frames. The descriptors for encoding color properties are: Auto 
Color Correlogram (ACC) ITh), Color Coherent Vector (CCV) ifTTl . 
Border/Interior pixel Classification (BIC) fTS), and Global Color 
Histogram (GCH) (H. The texture descriptors evaluated are: 
Generic Fourier Descriptor (GFD) l(^ and Haar-Wavelet Descrip¬ 
tor (HWD) |2D. For more details regarding those image descriptors, 
please refer to Ga. 

The experiments are divided into two phases. The first one eval¬ 
uates the genre classifier. The second one evaluates the BoG repre¬ 
sentation for video genre retrieval. 

4.1. Evaluation of the genre classifier 

The evaluation of the genre classifier is important because the quality 
of the final BoG vector depends on the quality of this classifier. If the 
genre classifier classifies the frames in wrong genres, the BoG vec¬ 
tor will not refiect the correct distribution of video genres. It would 
be similar to have a bad coding step in traditional visual dictionar¬ 
ies of quantized local features: wrong visual word labels would be 
assigned to image patches, resulting in a bad bag of visual words. 
Therefore, the BoG model depends on a good genre classifier. 

To create the visual dictionary, we trained a linear SVM (c = 
1.0) using features extracted from the training videos. The genre 
(label) of each training frame is the same of the video from where 
it was extracted. The training of the genre classifier was based on 
randomly selecting the same number N of frames per genre. We 
varied N in 100, 500, and 800 frames per genre. The remaining 
frames were used for testing. It is worth mentioning the amount 
of frames used in this evaluation: almost 4 million of the training 
videos and more than 7.2 million of the test videos (no frames of the 
test videos were used for training th^enre classifier). For running 
SVM, we used the LIBSVM packag^ES). 

Figure presents the classification accuracy for the evaluated 
descriptors. Notice that the classification accuracies are low for all 
the descriptors, creating a very challenging scenario for the BoG 
model, as we explained previously. The best results were obtained 
for the SVM model learned on 800 training frames per class. This 
model was used for representing the test videos using the BoG ap¬ 
proach in the following experiments. 

4.2. Evaluation of the BoG representation 

The following experiments evaluate the BoG model for video genre 
retrieval. Each video in the test set was represented by a bag of gen¬ 
res using the genre classifiers learned on the training step. With the 
BoG of each video, a given test video was used as query for the rest 
of videos in the test set, which were ranked according to the Eu¬ 
clidean (L 2 ) distance between their BoGs. Eor each genre, around 
five percent of the test videos were randomly selected and used as 
queries. Eive replications were performed in order to ensure sta¬ 
tistically sound results. Presented results refer to the average scores 
and their respective 99% confidence intervals, which were computed 
based on the mean and standard deviation of each replication. 

We compared the BoG approach against with two baselines: 
Histogram of Motion Patterns (HMP) [U and Bag of Scenes (BoS) lO 
To make a fair comparison, these approaches were configured with 
their best settings based on the results reported in El. The distance 
function used for feature comparison is the Euclidean (L 2 ) distance. 

^ http://www.csie.ntu.edu.tw/~cjlin/libsvni/ (As of January 2015) 
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Fig. 2. Evaluation of the genre classifier. All descriptors generated 
low discriminating genre classifiers (accuracy below 50%), creating 
a challenging scenario for the BoG model. 


The retrieval effectiveness was assessed using the precision at the 
top 10 retrieved items (PIO) and Mean Average Precision (MAP). 

In Eigurej^ we compare the BoG representations and the base¬ 
line methods with respect to the MAP and PIO measures. As we 
can observe, the performance of the BoG representations are better 
considering the MAP measure. MAP is a good indication of the ef¬ 
fectiveness considering all positions of obtained ranked lists. PIO, 
in turn, focuses on the effectiveness of the methods considering only 
the first positions of the ranked lists. 

The BoG approach achieved the best scores using BIC as the 
frame descriptor (used as basis for the genre classifier). Notice that 
BoG BIC performs better than the baseline methods for MAP, how¬ 
ever the same does not happen for PIO. BIC was the best descriptor 
for the genre classifier in the test set (see Section [4T] ), making it also 
better for generating the BoG vector. 

We also performed paired /-tests to verify the statistical signifi¬ 
cance of the results. Eor that, the confidence intervals for the differ¬ 
ences between paired averages of each class were computed to com¬ 
pare every pair of approaches. If the confidence interval includes 
zero, the difference is not significant at that confidence level. If the 
confidence interval does not include zero, then the sign of the differ¬ 
ence indicates which alternative is better. 

Table [^presents the 99% confidence intervals of the differences 
between BoG bic (the best configuration of BoG) and the base¬ 
line methods for the MAP and PIO measures, respectively. No¬ 
tice that the confidence intervals for BoG bic and BoS include zero 
and, hence, the differences between those approaches are not signif¬ 
icant at that confidence level. On the other hand, the performance of 
BoG BIC and HMP are not statistically different for MAP, whereas 
BoG BIC performs worse than HMP for PIO. This method is based 
on motion information and, hence, it does not consider visual prop¬ 
erties of video frames in an independent manner. 

Pigure|^ compares the individual scores obtained for each class 
in terms of MAP and PIO measures. It is interesting to note the 
differences in responsiveness of the different approaches with re¬ 
spect to each of the genres. For MAP, BoG bic performs better 
than the baseline methods for most of the classes (13 of 26). For 
PIO, BoG BIC provides a good discriminative power on genres like 
“school and education’' and “web development and sites”. 

The key advantage of the BoG model is its computational effi- 
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Fig. 3. Results for video genre retrieval comparing BoG with the baselines in terms of MAP and PIO. BoGbic obtained the best MAP score. 
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Fig. 4. MAP and PIO scores obtained for each genre. 


Table 1. Paired t-test comparing the best BoG configuration and 
the baselines. We can note intervals crossing the zero for BoGbic 
and BoS, indicating no statistical difference between methods. For 
BoGbic versus HMP, HMP is better for PIO. 


Approach 

MAP 

PIO 

min. 

max. 

min. 

max. 

BoGsjc ■ BoS 
BoGbic ■ HMP 

-0.018 

-0.074 

0.018 

0.007 

-0.063 

-0.232 

0.014 

-0.079 


ciency in terms of space occupation and similarity computation time. 
In our experiments, the BoG vector corresponds to a 26-bin his¬ 
togram, which represents a reduction of 74% in relation to the BoS 
vector (100-bin histogram) and is two orders of magnitude smaller 
than the HMP vector (6075-bin histogram), making our approach 
more suitable for real-time processing. 

Although the effectiveness the BoG approach is not superior to 
the baseline methods, the obtained results show the potential of the 
idea. As we explained previously, the quality of the genre classifier 
is important for the BoG quality. Our genre classifiers obtained less 
than 50% of accuracy in the training set and less than 30% in the test 
set, probably limiting the quality of the BoG representation. Another 
limitation is the dataset used. As all the frames of a video have 
the same label, visually different frames may be of the same genre, 
harming the classifier. 


5. CONCLUSIONS 

In this paper, we presented a new video representation for video 
genre retrieval, named Bag-of-Genres. This representation model 
relies on a dictionary of genres, which is created from a genre classi¬ 
fication model learned on the training frames. Different from tradi¬ 
tional dictionaries based on local features (e.g., SIFT or STIP), here, 
visual words correspond genre-labeled regions of the classification 
space. Therefore, each dimension of the feature space spanned by 
such a model is associated to a semantic concept. 

Our approach was validated in the dataset of MediaEval Tag¬ 
ging Task of 2012. Our experiments evaluated the importance of 
the genre classifier in the model as well as the quality of the BoG 
representation. In these experiments, the BoG model has performed 
well despite the low accuracy of the genre classifier. The results 
demonstrated that our approach performs similar to state-of-the-art 
methods, but using a much more compact representation. 

We can think about ways of improving the BoG model. For 
instance, a smarter strategy for feature extraction and classifica¬ 
tion may enable to create more informative visual dictionaries and, 
hence, improve the video representation. 

Future work includes the evaluation of other methods for feature 
extraction, as well as perform an extensive study on classification 
strategies to be used in the creation of visual dictionaries. We also 
would like to evaluate the use of a dataset of scene images to create 



































































































the genre classifier. 
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